Pandas in a nutshell

Pandas is the Python library designed specifically for data analysis. The author of Python for Data Analysis, Wes McKinney, began developing Pandas in 2008

while at AQR Capital Management out of need for a performant, flexible tool to perform quantitative analysis on financial data. Before leaving AQR he was able to convince management to allow him to open source the library.

Another AQR employee, Chang She, joined the effort in 2012 as the second major contributor to the library. Right around that time, the library became popular in the Python community, and many more contributors joined the project making it one of the most vital and active data analysis libraries for Python. (Wikipedia )

Pandas can be thought of the Python equivalent of Microsoft Excel. It abstracts the notion of the spreadsheet, allowing the user to use powerful and robust analytical tools generally to automate repeated processes.

The twin centerpieces of the Pandas library are the Series and the DataFrame. The Series class is, at its core, a one-dimensional NumPy array, surrounded by additional information, such as its index. The DataFrame is conceptually an array of Series classes, each sharing the same index.


In [ ]:
from pandas import Series, DataFrame
import pandas as pd

We will be using Wes McKinney's GitHub notebook as a skeleton. He imports the following libraries for later use:


In [ ]:
from __future__ import division
from numpy.random import randn
import numpy as np
import os
import matplotlib.pyplot as plt
np.random.seed(12345)
plt.rc('figure', figsize=(10, 6))
from pandas import Series, DataFrame
import pandas as pd
np.set_printoptions(precision=4)

Introduction to pandas data structures

Series

Consider the following input:


In [ ]:
obj = Series([4, 7, -5, 3])
obj

We have set the variable obj to reference a new Pandas Series, which we initialized by giving a Python list as input. Notice that Pandas automatically interprets the input data as type int64, which indicates that it is fairly smart! Also, notice that upon printing obj we see two columns. The first column is the index of the Series class, which is presently the natural index, range(4). The second column is the input data that we gave initially, which Pandas refers to as the values of obj. You can access these columns individualy by calling obj.index and obj.values, respectively. For example:


In [ ]:
print obj.index, "\n\n"
print obj.values

Indexing

As previously mentioned, the natural index simply starts at 0 and increments integers to the size of the list of the input values. Alternatively, we can specify the index explicitly when we initialize the Series, as in the following:


In [ ]:
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

What happens when you examine the index now?


In [ ]:
obj2.index

Don't be alarmed by the u prefix to each of the index values. In Python (as well as in other languages, this simply indicates a Unicode string. Ostensibly, there is no difference between normal strings and Unicode strings.

You can access a particular member of the Series data by specifying its index. For example,


In [ ]:
obj2['a']

This allows you to change the value of specific entries in your Series data. Additionally, you can call subSeries by specifying a sublist of the index.


In [ ]:
obj2['d'] = 6
obj2[['c', 'a', 'd']]

A powerful tool in Pandas is the ability to concisely access data meeting Boolean qualifications. In the case below, obj2 > 0 is given as the "index," and the output is the subSeries of obj2 for which all entries are positive.


In [ ]:
obj2[obj2 > 0]

Aside

What is obj2 > 0 actually?


In [ ]:
obj2 > 0

This is actually a neat property of Pandas which is similar to NumPy. In NumPy, suppose you are given an array:


In [ ]:
arr = np.random.rand(5) * 2.0 - 1.0

The actual array itself is given by


In [ ]:
arr

The Boolean array specifying which elements of arr are positive is given by


In [ ]:
boolArr = arr > 0.0
boolArr

Similarly, you can generate a new Series of Boolean values by subjecting the original Series to a Boolean statement, as we did above.

Broadcasting

Like NumPy, we can broadcast arithmetic operations onto Series data. For example,


In [ ]:
obj2 * 2

returns a Series whose values are doubled, and


In [ ]:
np.exp(obj2)

returns a Series whose values have all been subject to the transformation $x\mapsto e^x$. Notice additionally that the dtype of obj2 has automatically been changed from int64 to float64. Again, Pandas is being smart!

Querying a Series

In Python, there is a binary operator called in, which takes two "arguments." The left-hand argument is can be any type of data (or object, we won't get into this), while the right-hand argument is some type of iterable object. Then in returns True if the left-hand argument is an element of the right-hand argument. Mathematically, this is equivalent to set membership. For example,


In [ ]:
odds = [i for i in range(20) if i%2 == 1]
print 3 in odds, "|", 2 in odds

is equivalent to noting that if $$\text{Odds} = \{ n : 0 \leq n < 20 \text{ and } n \text{ is odd}\}$$ we have that $$3 \in \text{Odds}$$ while $$2 \notin \text{Odds}.$$

(In fact, we have already seen in in action with Python's for loop, which has the form

for element in iterative_object:

indicating that the code should loop through every element element that is a member of iterative_object.)

You can use in with Pandas Series to test that an element is a member of the index of the Series. For example,


In [ ]:
'b' in obj2

In [ ]:
'e' in obj2

Aside

We have talked about the native Python list data type. There is another important native data type in Python, called a dict, which you can learn about more here. Python dict types are similar to association lists in Scheme, in that they require a lookup key in order to access elements.

Crucially, Pandas can create a Series from a dict by interpreting the key for each item as its corresponding index value, which is actually quite natural. In this sense, I find that it is useful to think of the relationship between NumPy and Pandas as akin to the relationship between a list and a dict.


In [ ]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3

What happens when you use an existing dataset with a new index, in which there is a new, unfilled index?


In [ ]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4

In this case, California is a previously-unused index, which has no corresponding value. Thus, Pandas initializes the new Series with the value corresponding to California set to NaN (Python-speak for null).

The isnull method returns a Series of Boolean values whenever the original Series has a null (NaN) value.


In [ ]:
pd.isnull(obj4)

The notnull method does the exact opposite!


In [ ]:
pd.notnull(obj4)

The methods isnull and notnull are "static" in the sense that they can be called straight from the pd module or for a specific Series object.


In [ ]:
obj4.isnull()

Recall the two Series, obj3 and obj4:


In [ ]:
print "\tobj3:\n",obj3, "\n\n\tobj4:\n", obj4

Arithmetic operations between distinct Series objects work conservatively. For data types, int64 + float64 = float64 to preserve the decimal information. The summed index is the union of the two indices. Consider the following example:


In [ ]:
obj3 + obj4

No entry for California exists in obj3, while no entry for Utah exists in obj4. Pandas interprets NaN + x = NaN for all x, so the resultant Series sets NaN for both California and Utah.

We can set some metadata for a Series, such as the name of the values column and the name of the index column.


In [ ]:
obj4.name = 'population'
obj4.index.name = 'state'
obj4

You can also completely change the index at any time. This is something we will get into more detail later.


In [ ]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

DataFrame

Like we said before, you can think of a DataFrame as an array of Series objects. Specifically, a DataFrame is a two-dimensional array of Series objects, all indexed by the same index series. You can also think of a DataFrame as a single Microsoft Excel spreadsheet.

One way to initialize a DataFrame is by giving a dict where each key indicates a Python list.


In [ ]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)

frame

You can reorder the columns in a new DataFrame using the following argument:


In [ ]:
DataFrame(data, columns=['year', 'state', 'pop'])

Similarly, the index optional argument in DataFrame allows you to specify the index list. Additionally, adding a debt column with no corresponding data in data will initialize a column filled with NaN entries.


In [ ]:
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                   index=['one', 'two', 'three', 'four', 'five'])
frame2

You can access the columns of a DataFrame as follows:


In [ ]:
frame2.columns

You can slice a particular column by specifying its column name. Notice how this returns a Series.


In [ ]:
frame2['state']

Alternatively, you can slice a column using the following syntax:


In [ ]:
frame2.year

To slice a row, you can specify an index, which will return a Series representing the row at the index.


In [ ]:
frame2.ix['three']

Broadcasting works in the natural way that you might expect:


In [ ]:
frame2['debt'] = 16.5
frame2

You can also give a particular column a list or ndarray, which will then be distributed across the column.


In [ ]:
frame2['debt'] = np.arange(5.)
frame2

Finally, you can give a column of a DataFrame a Series. If you specify a Series with an index differing from the main DataFrame, then the entries of the DataFrame will be set to NaN.


In [ ]:
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

The point of Pandas is there are numerous ways to achieve the same effect, depending on whatever is easiest for the task at hand. Here is another way to add a column:


In [ ]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

We can also use Python's del function to remove a column:


In [ ]:
del frame2['eastern']
frame2.columns

One final way to initialize DataFrame objects is with nested dict objects.


In [ ]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

frame3 = DataFrame(pop)
frame3

You can transpose a DataFrame if it makes more sense to work with the rows and columns flipped.


In [ ]:
frame3.T

You can do this transpose operation from the outset by manually specifying the index.


In [ ]:
DataFrame(pop, index=[2001, 2002, 2003])

DataFrame objects can also be initialized from dicts of Series objects.


In [ ]:
pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}
DataFrame(pdata)

In [ ]:
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

If you need to access the underlying ndarray from any DataFrame, use the DataFrame.values field.


In [ ]:
frame3.values

In [ ]:
frame2.values

Index objects

The Index is the "metadata" object for Series and DataFrame objects. We've seen ways of initializing Index objects before, so we will go over some features of these objects.


In [ ]:
obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

Index objects can be sliced like arrays.


In [ ]:
index[1:]

Importantly, Index objects are not mutable, so you can't change their values in the natural way:


In [ ]:
index[1] = 'd'

You can initialize Index objects with NumPy ndarray objects.


In [ ]:
index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index=index)
obj2.index is index

In [ ]:
frame3

In [ ]:
print 'Ohio' in frame3.columns, "|", 2003 in frame3.index

Essential functionality

Now that we are familiar with the basic objects in Pandas, we will start working with the mechanics of these objects.

Reindexing

In the previous section we mentioned that Index objects are immutable. Here we will address this issue.


In [ ]:
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

The simplest way to change an Index object in an existing Series or DataFrame is with the reindex method.


In [ ]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

In the above example, since "e" was not in the original Index, the corresponding Series value is set to NaN. If you want to change the default fill value, reindex can take an additional parameter, fill_value.


In [ ]:
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

A different approach uses a method parameter that attempts to extrapolate existing data into the new Index. One such method is ffill, which "step-fills" the existing data forward. Alternatively, bfill "step-fills" the data backwards.


In [ ]:
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')

The reindex method works for DataFrame objects as well. For DataFrame objects, reindex can also specify column reindexing.


In [ ]:
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
                  columns=['Ohio', 'Texas', 'California'])
frame

In [ ]:
frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill',
              columns=states)

Alternatively, you can use ix to achieve the same effect more concisely.


In [ ]:
frame.ix[['a', 'b', 'c', 'd'], states]

Dropping entries from an axis

Suppose you have a Series object with data you wish to remove. Using the drop method, you can specify an index element to remove.


In [ ]:
obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')
new_obj

You can also drop a list of index elements at once.


In [ ]:
obj.drop(['d', 'c'])

The same works for DataFrame objects and the drop method.


In [ ]:
data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])

In [ ]:
data.drop(['Colorado', 'Ohio'])

Additionally, DataFrame.drop() can remove columns by specifying an axis parameter.


In [ ]:
data.drop('two', axis=1)

In [ ]:
data.drop(['two', 'four'], axis=1)

Indexing, selection, and filtering

In this section we will explore the various techniques available for slicing Series and DataFrame objects. One the one hand, we can deal with these objects as dict structures, accessing elements by requesting their index keys. On the other hand, we can treat these objects as list structures, accessing elements by the order of the index list.


In [ ]:
obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj['b']

In [ ]:
obj[1]

This flexibility allows you to incorporate all of the previous array slicing that worked for NumPy ndarray objects.


In [ ]:
obj[2:4]

Conversely, you can use a list of dict keys to achieve the same end.


In [ ]:
obj[['b', 'a', 'd']]

Here are some alternative slicing techniques for Series objects.


In [ ]:
obj[[1, 3]]

In [ ]:
obj[obj < 2]

In [ ]:
obj['b':'c']

You can assign values to sub-objects which then reflect on the original object.


In [ ]:
obj['b':'c'] = 5
obj

The same capabilities are extended to the DataFrame objects. The added flexibility is that the same indexing techniques also apply to column slicing as well as index slicing.


In [ ]:
data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
data

In [ ]:
data['two']

In [ ]:
data[['three', 'one']]

The natural slicing will always refer to the index list, not the column list, which is useful to keep in mind.

data[:2]


In [ ]:
data[data['three'] > 5]

Recall that you can generate a corresponding Boolean array by subjecting a DataFrame to a boolean statement, such as the following:


In [ ]:
data < 5

You can use Boolean arrays to do simple thresholding to your data. You can isolate entries in your data subject to identical Boolean conditions, and manipulate these specific subsets of the data.


In [ ]:
data[data < 5] = 0
data

The DataFrame.ix field gives you even more powerful ways to slice your data. In general, slicing works by providing two arguments, an index and a column specification, and it will then return that particular subset.


In [ ]:
data.ix['Colorado', ['two', 'three']]

You can overload requests by using a list of index or column elements. Additionally, you may reorder the indices or columns in your subset by permuting the order of the specified elements, so long as they exist in the original DataFrame.


In [ ]:
data.ix[['Colorado', 'Utah'], [3, 0, 1]]

The ix approach is very powerful. See if you can work through the mechanics of the next few examples to see just how versatile slicing with ix actually is.


In [ ]:
data.ix[2]

In [ ]:
data.ix[:'Utah', 'two']

In [ ]:
data.ix[data.three > 5, :3]

Arithmetic and data alignment

As we mentioned before, we can do arithmetic on Series and DataFrame objects.


In [ ]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
print s1, "\n\n"
print s2

Importantly, arithmetic is only performed on elements sharing an index. If either object has an index value that the other does not, the arithmetic operation is undefined, so the resultant object contains an NaN element.


In [ ]:
s1 + s2

The same holds for DataFrame arithmetic, except now it requires that both the index and column of each DataFrame object is well-defined.


In [ ]:
df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                index=['Ohio', 'Texas', 'Colorado'])
df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print df1, "\n\n"
print df2

In [ ]:
df1 + df2

Arithmetic methods with fill values

Often NaN values are undesirable, as they can cause errors when doing arithmetic operations on the data.


In [ ]:
df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))

print df1, "\n\n"
print df2

In [ ]:
df1 + df2

This can be avoided by using the built-in DataFrame.add() method, which takes as parameters a DataFrame object and an optional fill_value which deals with otherwise NaN entries.


In [ ]:
df1.add(df2, fill_value=0)

In fact, most DataFrame organization methods take fill_value as a parameter to deal with undefined cases, such as reindex.


In [ ]:
df1.reindex(columns=df2.columns, fill_value=0)

Operations between DataFrame and Series

Broadcasting NumPy arrays is a very useful technique for performing arithmetic operations concisely. And efficiently, actually. This is because while normal Python arithmetic is interpreted, NumPy arithmetic is based on compiled C code, which is much more efficient in general.


In [ ]:
arr = np.arange(12.).reshape((3, 4))
arr

Normally we think of broadcasting a scalar element onto a one-dimensional array vector. In fact, broadcasting is much more powerful, because you can broadcast an array over a bigger array.


In [ ]:
arr[0]

In [ ]:
arr - arr[0]

DataFrame and Series objects work along similar lines.


In [ ]:
frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.ix[0]
print frame, "\n\n"
print series

You can broadcast the values in a Series over its parent DataFrame as you would with NumPy ndarrays.


In [ ]:
frame - series

Of course, if either a Series and DataFrame object has index or column values the other does not, the undefined arithmetic simply is sent to NaN. (We discussed ways to avoid this issue in the previous sections).


In [ ]:
series2 = Series(range(3), index=['b', 'e', 'f'])
frame + series2

In [ ]:
series3 = frame['d']

print frame, "\n\n"
print series3

Using the built-in DataFrame arithmetic operations such as add or sub gives the option to specify the axis (0: index, 1: columns) over which the arithmetic will take place (again, you can use fill_value to avoid potential NaN values).


In [ ]:
frame.sub(series3, axis=0)

Function application and mapping

One of the most important capabilities of Series and DataFrame is the ability to apply function transformations to the data. Every ufunc defined by NumPy can be applied to a DataFrame (or Series) object.


In [ ]:
frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

For example, you can apply a nonnegativity transform by including a built-in NumPy absolute value.


In [ ]:
np.abs(frame)

You can define and apply custom functions in two fashions. One is by using lambdas to construct anonymous functions:


In [ ]:
frame.apply(lambda x: x.max() - x.min())

In [ ]:
frame.apply(lambda x: x.max() - x.min(), axis=1)

Alternatively, you can define your own unary function and simply apply it using the same overall approach.


In [ ]:
def f(x):
    return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

For presentations and general readability, it is useful to format decimal or date values into condensed forms, and Pandas lets you achieve this by using the applymap method for DataFrame and Series objects. The difference between apply and applymap is rather subtle and often functionally neglibible, but the idea is that apply works on a particular subsets of rows or columns, whereas applymap is element-wise.


In [ ]:
format = lambda x: '%.2f' % x
frame.applymap(format)

Alternatively, you can use the built-in Python map function.


In [ ]:
frame['e'].map(format)

Exercise:

Determine which of apply or map is computationally more efficient.

Sorting and ranking

One fundamental problem in data analysis, let alone computer science in general, is sorting data. Pandas provides a number of techniques for sorting information in the index, columns, and the actual data itself.

The first technique is sort_index, which is a method for both Series and DataFrame objects. For Series objects, sort_index works as follows:


In [ ]:
obj = Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()

Since there is only one meaningful index to sort, the labels, sort_index is a very intuitive method. I want to point out that sort_index does not have side-effects; that is, calling sort_index on an object does not actually change the internals of the object itself. Instead, a sorted copy of the original object is produced.

The method sort_index works similarly with DataFrame objects, but now there are two potential axes along which to sort. The default is the index, as we see below:


In [ ]:
frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
                  columns=['d', 'a', 'b', 'c'])
frame.sort_index()

By specifying the axis as a parameter, one can choose the columns instead. (Recall that in Python everything begins at 0, so the second axis corresponds to axis number 1).


In [ ]:
frame.sort_index(axis=1)

The sort_index method also allows you to flip the ordering by specifying the ascending parameter.


In [ ]:
frame.sort_index(axis=1, ascending=False)

If you want to sort the elements themselves, as opposed to the index, Pandas provides the order method for Series objects.


In [ ]:
obj = Series([4, 7, -3, 2])
obj.order()

By default, NaN values are placed at the end upon sorting the Series.


In [ ]:
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.order()

For DataFrame objects you can specify the index or column you wish to sort. Additionally, if your data set is properly constructed, you can sort by two columns or indices, as the below example exhibits:


In [ ]:
frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})

frame.sort_index(by='b')

In [ ]:
frame.sort_index(by=['a', 'b'])

Axis indexes with duplicate values

It is possible for a Series or DataFrame object not to have a unique index. For example:


In [ ]:
obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

Pandas has a field for the index of any object to indicate whether or ot the index is unique (no duplicate indices).


In [ ]:
obj.index.is_unique

If an index is not unique, then slicing the object for a repeated index returns a sub-object. For example:


In [ ]:
obj['a']

For unique index items, the default return-type is a scalar:


In [ ]:
obj['c']

The same goes for DataFrame objects, although they are more complicated.


In [ ]:
df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df

Remember that you have to slice the index using ix, and you will observe the same behavior.


In [ ]:
df.ix['b']

Summarizing and computing descriptive statistics

Oftentimes, you need a quick way to come up with basic summary statistics of data sets. The solution that Pandas provides is incredibly robust, especially with regard to NaN entries.


In [ ]:
df = DataFrame([[1.4, np.nan], [7.1, -4.5],
                [np.nan, np.nan], [0.75, -1.3]],
               index=['a', 'b', 'c', 'd'],
               columns=['one', 'two'])
df

By default, the sum method will skip NaN entries for each column in a DataFrame.


In [ ]:
df.sum()

For the DataFrame object, you can also apply along either the index axis or the column axis. Again, sum will skip over NaN elements when arriving at a value.


In [ ]:
df.sum(axis=1)

If you don't want this behavior, you can always tell the statistics function you are applying not to skip the NaN entries. Here is an example using mean:


In [ ]:
df.mean(axis=1, skipna=False)

Another useful statistic is idxmax, which returns the index of the maximum value of a column in a DataFrame.


In [ ]:
df.idxmax()

One incredibly useful method is cumsum, which has a number of important applications in the analysis of probability distributions and random walks.


In [ ]:
df.cumsum()

You can also get a quick overview of all of the summary statistics of a DataFrame simply by calling the describe method.


In [ ]:
df.describe()

Descriptive and Summary Statistics

Method Description
count Number of non-NaN values
describe Compute set of summary statistics for Series or each DataFrame column
min, max Compute minimum and maximum values
argmin, argmax Compute index locations for minimum and maximum values
idmin, idmax Compute index values for minimum and maximum values
quantile Compute sample quantile ranging from 0 to 1
sum Sum of values
mean Mean of values
median Arithmetic median of values
mad Mean absolute deviation from mean value
var Sample variance of values
std Sample standard deviation of values
skew Sample skewness (3rd moment) of values
kurt Sample kurtosis (4th moment) of values
cumsum Cumulative sum of values
cummin, cummax Cumulative min and max of values
cumprod cumulative product of values
diff Compute 1st arithmetic difference (useful for time series
pct_change Compute percent changes

Series objects also have a describe method. The describe method outputs statistics based on the dtype of the underlying object. In the above example, df had a dtype of float64, so describe produced information pertinent to floating-point numerics. In the below example, the Series object has a dtype of object, which results in different summary statistics.


In [ ]:
obj = Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()

Correlation and covariance

One common problem in data analysis, especially in the analysis of time series data like historical prices for financial securities, is correlation and covariance analysis. To this end Pandas has a number of features to make the analysis simple.

Here is one example, using a built-in data aggregator using Yahoo! Finance in the Pandas API. Returns on a stock are defined as the percent change in the stock's closing value from day-to-day.


In [ ]:
import pandas.io.data as web

all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'CSCO']:
    all_data[ticker] = web.get_data_yahoo(ticker)

price = DataFrame({tic: data['Adj Close']
                   for tic, data in all_data.iteritems()})
volume = DataFrame({tic: data['Volume']
                    for tic, data in all_data.iteritems()})

returns = price.pct_change()
returns.tail()

When given a Series object, the corr method computes the scalar correlation between the Series and another Series.


In [ ]:
returns.MSFT.corr(returns.IBM)

By contrast, corr and cov returns a correlation and covariance matrix DataFrame with filled correlation and covariance values, respectively.


In [ ]:
returns.MSFT.cov(returns.IBM)

In [ ]:
returns.corr()

In [ ]:
returns.cov()

The corrwith method computes pairwise correlations and stores the resultant in a Series. Note that the correlation between IBM and IBM is 1.


In [ ]:
returns.corrwith(returns.IBM)

Passing a DataFrame instead computes correlation with like-columns.


In [ ]:
returns.corrwith(volume)

Unique values, value counts, and membership

Given data with repeats, you can eliminate the excess by using the unique method.


In [ ]:
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
uniques

The value_counts returns a Series with an index made up of the unique entries in the original Series, and the new entries give the total appearances of each value.


In [ ]:
obj.value_counts()

You can perform set-membership operations to, for example, construct masks which you can then apply to your original data.


In [ ]:
mask = obj.isin(['b', 'c']) # This forms a Series object of Boolean values
obj[mask]

For DataFrame objects, you can apply the value_counts method to each subseries, producing a new DataFrame of frequency statistics.


In [ ]:
data = DataFrame({'Qu1': [1, 3, 4, 3, 4],
                  'Qu2': [2, 3, 1, 2, 3],
                  'Qu3': [1, 5, 2, 4, 4]})
data.apply(pd.value_counts).fillna(0)

Handling missing data

One of the primary problems with data analysis is the prevalence of missing data. In many cases, arithmetic operations, summary statistics, and other functions require that your data be intact in order to provide meaningful results. Pandas gives a number of functions to address the problem of missing data, allowing you to filter it out easily.

Consider this Series of string values.


In [ ]:
string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

The isnull method identifies every NaN entry. Alternatively,notnull will identify every non-NaN entry.


In [ ]:
string_data.isnull()

In [ ]:
string_data[0] = None
string_data.notnull()

Filtering out missing data

A simple way to remove missing entries from a Series object is to use dropna.


In [ ]:
from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data.dropna()

Alternatively, you can use Boolean Series and notnull to mask the original data.


In [ ]:
data[data.notnull()]

DataFrame objects are trickier. For example, how should Pandas handle a mostly-complete row? The correct answer is ambiguous. By default, dropna will eliminate any row with a NaN (we redefined NaN to NA here) value.


In [ ]:
data = DataFrame([[1., 6.5, 3.], [1., NA, NA],
                  [NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()
data

In [ ]:
cleaned

Alternatively, you can require that a row be eliminated only if it is completely empty.


In [ ]:
data.dropna(how='all')

You can also specify columns for deletion. Again, you can change the deletion requirements as needed.


In [ ]:
data[4] = NA # fill a column entirely with NA
data.dropna(axis=1, how='all')

The dropna method is very robust. You can also specify a minimum threshold of data in a particular row as a criterion for deletion. In the next example, we threshold at 2 entries per row, allowing rows with one NaN value to stay while deleting any more patchy rows.


In [ ]:
df = DataFrame(np.random.randn(7, 3))
df.ix[:4, 1] = NA; df.ix[:2, 2] = NA
df

In [ ]:
df.dropna(thresh=2)

Filling in missing data

Instead of eliminating missing data outright, Pandas lets you fill in the missing values. The simple approach, using fillna, is to pass a value that will then replace every NaN entry.


In [ ]:
df.fillna(0)

Alternatively, you can specify different fill values in different columns by giving a dict with keys of column names.


In [ ]:
df.fillna({1: 0.5, 2: -1})

Using the inplace argument, you can overwrite the original DataFrame object.


In [ ]:
# always returns a reference to the filled object
_ = df.fillna(0, inplace=True)
df

The other main filling technique is to fill by procedure. ffill will copy the previous value in a column into the NaN entry.


In [ ]:
df = DataFrame(np.random.randn(6, 3))
df.ix[2::2, 1] = NA; df.ix[4:, 2] = NA
print df, "\n\n"
print df.fillna(method='ffill')

In cases where you don't want this to extend indefinitely, you can limit the fill method to a certain number of NaN entries after the last available one.


In [ ]:
df.fillna(method='ffill', limit=1)

Hierarchical indexing

From "Python for Data Analysis":

Hierarchical indexing is an important feature of pandas enabling you to have multiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form. Let's start with a simple example; create a Series with a list of lists or arrays as the index:


In [ ]:
data = Series(np.random.randn(10),
              index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
                     [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
data

Notice how the index a corresponds to the sub-indices 1 and 2, and their corresponding data. The index object is thus not a simple list but a series of lists corresponding to the inner sub-indices.


In [ ]:
data.index

Accessing an outer label will give you the sub-Series that it corresponds to.


In [ ]:
print data['b'], "\n\n"
print data['b':'c']

You can access sub-indices, which returns the Series of all upper indices and their corresponding values.


In [ ]:
data[:, 2]

You can use unstack to take the multi-index and place it into a DataFrame object.


In [ ]:
data.unstack()

The inverse of unstack is stack. Observe:


In [ ]:
data.unstack().stack()

Multi-indexing has a similar logic with DataFrame objects, but it becomes more complicated as both the index and the columns can be given a hierarchy:


In [ ]:
frame = DataFrame(np.arange(12).reshape((4, 3)),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=[['Ohio', 'Ohio', 'Colorado'],
                           ['Green', 'Red', 'Green']])
frame

For clarity, let's rename the labels so we know what level we are looking at.


In [ ]:
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame

Now by specifying any column, whether on the top level or any sublevel, you can get the DataFrame of values corresponding to the name.


In [ ]:
frame['Ohio']

MultiIndex objects are independent in Pandas, meaning that you can create them without a corresponding DataFrame and reuse them as needed.


In [ ]:
from pandas import MultiIndex
MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
                       names=['state', 'color'])

Reordering and sorting levels

You can always swap indices on the same level. For example, if you want key2 and key1 to switch, you can write


In [ ]:
frame.swaplevel('key1', 'key2')

Additionally, you can sort a particular index (in general, you can't sort them all). Specify the index by its order (first is 0, second is 2), and you will see the sort take place:


In [ ]:
frame.sortlevel(1)

As with the object-oriented paradigm, you can combine these actions into one statement. For example:


In [ ]:
frame.swaplevel(0, 1).sortlevel(0)

Summary statistics by level

With hierarchical indexing, you can specify the level and axis with which to compute summary statistics. If one wants to compute the sum of all values in the key2 index, you get the relevant sub-DataFrame.


In [ ]:
frame.sum(level='key2')

This of course gets extended to the columns as well, which you have grown accustomed to with DataFrame methods.


In [ ]:
frame.sum(level='color', axis=1)

Using a DataFrame's columns

In the examples above we showed how to stack and unstack Series objects into DataFrames. But in general DataFrame objects give you a lot of discretion regarding which columns you want to convert into indices.


In [ ]:
frame = DataFrame({'a': range(7), 'b': range(7, 0, -1),
                   'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
                   'd': [0, 1, 2, 0, 1, 2, 3]})
frame

You can overload set_index with more than one column to produce a hierarchical index using the values of each respective column.


In [ ]:
frame2 = frame.set_index(['c', 'd'])
frame2

Crucially, the default Pandas behavior is to remove the indexed columns. You can force Pandas to keep the old columns by specifying the drop parameter:


In [ ]:
frame.set_index(['c', 'd'], drop=False)

In [ ]:
frame2.reset_index()